1 Youth Risk Behavior Surveillance

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. We will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

1.1 Load and skim the data

This data is part of the openintro textbook and we can load and inspect it. There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file: ?yrbss

data(yrbss)
glimpse(yrbss)  #13 variables
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, …
## $ gender                   <chr> "female", "female", "female", "female", "fem…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9",…
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not…
## $ race                     <chr> "Black or African American", "Black or Afric…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, …
## $ weight                   <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2,…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not…
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did no…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7,…
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+"…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7,…
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5…
skimr::skim(yrbss)  #get a feel for missing values, summary statistics of numerical variables, and a very rough histogram
Data summary
Name yrbss
Number of rows 13583
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 12 1.00 4 6 0 2 0
grade 79 0.99 1 5 0 5 0
hispanic 231 0.98 3 8 0 2 0
race 2805 0.79 5 41 0 5 0
helmet_12m 311 0.98 5 12 0 6 0
text_while_driving_30d 918 0.93 1 13 0 8 0
hours_tv_per_school_day 338 0.98 1 12 0 7 0
school_night_hours_sleep 1248 0.91 1 3 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 77 0.99 16.16 1.26 12.00 15.0 16.00 17.00 18.00 ▁▂▅▅▇
height 1004 0.93 1.69 0.10 1.27 1.6 1.68 1.78 2.11 ▁▅▇▃▁
weight 1004 0.93 67.91 16.90 29.94 56.2 64.41 76.20 180.99 ▆▇▂▁▁
physically_active_7d 273 0.98 3.90 2.56 0.00 2.0 4.00 7.00 7.00 ▆▂▅▃▇
strength_training_7d 1176 0.91 2.95 2.58 0.00 0.0 3.00 5.00 7.00 ▇▂▅▂▅

1.2 Exploratory Data Analysis

1.2.1 Analyzing the weight of participants in kilograms

1. From skimr::skim() above, we can see there are 1004 observations missing in weight

2. Using visualization and summary statistics to describe the distribution of weights

yrbss %>% 
  summarise(mean_weight = mean(weight, na.rm=TRUE), #mean is 67.9065
            median_weight=median(weight, na.rm = TRUE)) #median is 64.41
## # A tibble: 1 x 2
##   mean_weight median_weight
##         <dbl>         <dbl>
## 1        67.9          64.4
yrbss %>% 
  filter(!is.na(weight)) %>% #remove the missing observations
  #plotting the distribution
  ggplot(mapping = aes(x=weight)) +
    geom_density() +
   #calculating the mean and median of weight and plotting them onto the distribution, mean = 67.91, median = 64.41
    geom_vline(xintercept = mean(yrbss$weight,na.rm=TRUE), colour = "lightslateblue")+
    geom_vline(xintercept = median(yrbss$weight, na.rm = TRUE), colour = "tomato") +
    annotate("text", x = 80, y = 0.031, label = "Mean=67.91", colour = "lightslateblue")+
    annotate("text", x = 50, y = 0.031, label = "Median=64.41", colour = "tomato")+
    labs(x = "Weight in Kilograms",
         y= "Density of observations",
         title = "Right-Skewed Distribution of High Schoolers' Weights",
         subtitle = "Most high schoolers' weights are concentrated around 60kg \n however a significant number far above that average")+
  theme_economist_white()+
  NULL

3. Possible relationship between a high schooler’s weight and their physical activity

First of all, creating a new variable physical_3plus, which will be yes if they are physically active for at least 3 days a week, and no otherwise.

There are 66.91% high schoolers are physically active for at least 3 days a week.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(physically_active_7d >= 3, "yes", "no"))

yrbss %>% filter(!is.na(physical_3plus)) %>% 
  group_by(physical_3plus) %>% 
  summarise(count = n()) %>% 
  mutate(prop= count/sum(count))
## # A tibble: 2 x 3
##   physical_3plus count  prop
##   <chr>          <int> <dbl>
## 1 no              4404 0.331
## 2 yes             8906 0.669

The 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week is (0.32,0.34)

yrbss %>% 
  filter(!is.na(physical_3plus)) %>% 
  group_by(physical_3plus) %>% 
  summarise(count = n()) %>% 
  mutate(
    prop= count/sum(count), 
    SE_active3 = sqrt(prop*(1-prop)/sum(count)),
    lower95_ci = prop + qnorm(0.025,0,1)*SE_active3, 
    upper95_ci = prop + qnorm(0.975,0,1)*SE_active3) %>% 
  filter(physical_3plus == "no")
## # A tibble: 1 x 6
##   physical_3plus count  prop SE_active3 lower95_ci upper95_ci
##   <chr>          <int> <dbl>      <dbl>      <dbl>      <dbl>
## 1 no              4404 0.331    0.00408      0.323      0.339

Make a boxplot of physical_3plus vs. weight

yrbss %>% 
  #making a boxplot
  filter(!is.na(physical_3plus), !is.na(weight)) %>% 
  ggplot(mapping = aes(x=weight, y = physical_3plus)) +
  geom_boxplot() +
  labs(x = "Weight in Kilograms",
       y= "Physically active for at least 3 days a week?",
       title = "Those who exercise turn out heavier than those who don't") +
  NULL

We expected to see higher weights for those who exercise less, while the opposite seems to be the case. This might be because weight is not a perfect measure of fitness and can be associated to greater muscle mass for example or other factors of those who exercise. Another explanation could be that some of the people that responded yes in survey might be exercising more often precisely to get in better shape. Furthermore, although the average is higher for those who exerecise, those who do NOT exercise have higher variance and much larger outliers, which are potentially those who are overweight or obese.

1.3 Confidence Interval

Boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using either a confidence interval or a hypothesis test.

yrbss %>%
  group_by(physical_3plus) %>%
  filter(!is.na(physical_3plus)) %>% 
  summarise(mean_weight = mean(weight, na.rm = TRUE),
            sd_weight = sd(weight, na.rm=TRUE),
            count = n(),
            se_weight = sd_weight/sqrt(count),
            t_critical = qt(0.975, count-1), 
            margin_of_error = t_critical * se_weight,
            lower = mean_weight - t_critical * se_weight,
            upper = mean_weight + t_critical * se_weight
            )
## # A tibble: 2 x 9
##   physical_3plus mean_weight sd_weight count se_weight t_critical
##   <chr>                <dbl>     <dbl> <int>     <dbl>      <dbl>
## 1 no                    66.7      17.6  4404     0.266       1.96
## 2 yes                   68.4      16.5  8906     0.175       1.96
## # … with 3 more variables: margin_of_error <dbl>, lower <dbl>, upper <dbl>

There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.

1.4 Hypothesis test with formula

Write the null and alternative hypotheses for testing whether mean weights are different for those who exercise at least 3 times a week and those who don’t.

t.test(weight ~ physical_3plus, data = yrbss)
## 
##  Welch Two Sample t-test
## 
## data:  weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.42 -1.12
## sample estimates:
##  mean in group no mean in group yes 
##              66.7              68.4

The test result shows that the p-value is very low, smaller than 1%, so we reject the null hypothesis and conclude that the mean weights of two groups who exercise at least 3 times a week and those who don’t, are in fact different in a statistically significant way.

1.5 Hypothesis test with infer

Next, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.

# initialize the test, which we save as `obs_diff`
obs_diff <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  
  # the statistic we are searching for is the difference in means, with the order being yes - no != 0 
  calculate(stat = "diff in means", order = c("yes", "no")) 


# simulate the test on the null distribution, which we save as null
null_dist <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  
  # set the null hypothesis as a test for independence, i.e., that there is no difference between the two population means. In one sample cases, the null argument can be set to *point* to test a hypothesis relative to a point estimate
  hypothesize(null = "independence") %>% 
  
  # the `type` argument within generate is set to permute, which is the argument when generating a null distribution for a hypothesis test
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

Visualize this null distribution

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

Visualising to see how many of these null permutations have a difference of at least obs_stat of 1.77

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

Calculating the p-value for the hypothesis test using the function infer::get_p_value()

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Above is the standard workflow for performing hypothesis tests.

2 IMDB ratings: Differences between directors

Recall the IMBD ratings data in homework1. We would like to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not. Below is the calculated the confidence intervals for the mean ratings of these two directors and as you can see they overlap. We will reproduce this graph and run a hpothesis test using both the t.test command and the infer package to simulate from a null distribution, where we assume zero difference between the two.

2.1 Load the data and examine its structure

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Aveng…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", …
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorro…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 2…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, …
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08,…
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08,…
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 92…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, …
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 3…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2,…
# produce the data we will use today
Burton_Spielberg <- movies %>% 
  filter(director %in% c("Steven Spielberg", "Tim Burton"))

2.2 Reproducing the graph

g1 <- Burton_Spielberg %>% 
  group_by(director) %>% 
  summarize(n = n(), 
            mean_rating = mean(rating, na.rm = TRUE), 
            SD_directors = sd(rating, na.rm = TRUE),
            SE_directors = SD_directors/sqrt(n),
            t_critical = qt(0.975, n-1),
            lower95_ci = mean_rating - t_critical*SE_directors,
            upper95_ci = mean_rating + t_critical*SE_directors)

g1 %>% 
  ggplot(mapping = aes(x = mean_rating, y= reorder(director, mean_rating)))+
  geom_rect(aes(xmin = max(lower95_ci), xmax = min(upper95_ci), ymin = -Inf, ymax = Inf), alpha = 0.1, fill = "black") +
  geom_point(aes(color = as.factor(director)), show.legend = FALSE, size = 5) +
  geom_errorbar(aes(xmin = lower95_ci, xmax = upper95_ci, color = as.factor(director)), width = 0.1, show.legend = FALSE, size = 1.8) +
  theme_minimal()+
  labs(x = "Mean IMDB Rating",
       y = "",
       title = "Do Spielberg and Burton have the same mean IMDB rating?",
       subtitle = "95% confidence intervals overlap"
       )+
  geom_text(aes(label = round(mean_rating,2)), vjust=-1, size=6)+
  geom_text(aes(x=upper95_ci, label = round(upper95_ci,2)), vjust=-1, size=5)+
  geom_text(aes(x=lower95_ci, label = round(lower95_ci,2)), vjust=-1, size=5)+
  NULL

2.3 Hypothesis test with formula

  • Null hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are the same
  • Alternative hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are different
  • p-value: if p-value is lower than 5%, then reject H0 and think the mean IMDB rating for Steven Spielberg and Tim Burton are different
t.test(rating ~ director, data = Burton_Spielberg)
## 
##  Welch Two Sample t-test
## 
## data:  rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.16 1.13
## sample estimates:
## mean in group Steven Spielberg       mean in group Tim Burton 
##                           7.57                           6.93

2.4 Hypothesis test with infer

  • Null hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are the same
  • Alternative hypotheses: The mean IMDB rating for Steven Spielberg and Tim Burton are different
  • p-value: if p-value is lower than 5%, then reject H0 and think the mean IMDB rating for Steven Spielberg and Tim Burton are different
# initialize the test, which we save as `directors_diff`
directors_diff <- Burton_Spielberg %>%
    specify(rating ~ director) %>%
  
    # the statistic we are searching for is the difference in means, with the order being "Steven Spielberg", "Tim Burton"
    calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))

# simulate the test on the null distribution, which we save as null
directors_null_dist <- Burton_Spielberg %>%
    specify(rating ~ director) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 1000, type = "permute") %>%
    calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))

# Visualising to see how many of these null permutations have a difference
directors_null_dist %>% visualize() +
  shade_p_value(obs_stat = directors_diff, direction = "two-sided")

# Calculating the p-value for the hypothesis test
directors_null_dist %>%
  get_p_value(obs_stat = directors_diff, direction = "two_sided")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1    0.03

From the hypothesis test, we can see that the t-stat is 3 and the p-value is 1% which is smaller than 5%, so we reject the null hypothesis and conclude that the the difference in mean IMDB ratings for Steven Spielberg and Tim Burton is statistically significant, specifically the mean rating of Steven Spielberg is higher. This is just like what we expected, as we all prefer Steven Spielberg.

3 Omega Group plc- Pay Discrimination

At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.

You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.

3.1 Loading the data

omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame
## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 63…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "m…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, …

3.2 Relationship Salary - Gender ?

The data frame omega contains the salaries for the sample of 50 executives in the company. Can you conclude that there is a significant difference between the salaries of the male and female executives?

omega %>%
  group_by(gender) %>%
  summarise(mean_salary = mean(salary),
            sd_salary = sd(salary),
            count2 = n(),
            se_salary = sd_salary/sqrt(count2),
            t_critical2 = qt(0.975, count2-1),
            margin_of_error2 = t_critical2*se_salary,
            lower_ci = mean_salary - t_critical2*se_salary,
            upper_ci = mean_salary + t_critical2*se_salary)
## # A tibble: 2 x 9
##   gender mean_salary sd_salary count2 se_salary t_critical2 margin_of_error2
##   <chr>        <dbl>     <dbl>  <int>     <dbl>       <dbl>            <dbl>
## 1 female      64543.     7567.     26     1484.        2.06            3056.
## 2 male        73239.     7463.     24     1523.        2.07            3151.
## # … with 2 more variables: lower_ci <dbl>, upper_ci <dbl>

There is an observed difference of about 8,696$ in salary, and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.

# hypothesis testing using t.test() 
t.test(salary ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239
# hypothesis testing using infer package

# initialize the test, which we save as `directors_diff`
salary_diff <- omega %>%
    specify(salary ~ gender) %>%
  
    # the statistic we are searching for is the difference in means, with the order being "male", "female"
    calculate(stat = "diff in means", order = c("male", "female"))

# simulate the test on the null distribution, which we save as null
salary_null_dist <- omega %>%
    specify(salary ~ gender) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 1000, type = "permute") %>%
    calculate(stat = "diff in means", order = c("male", "female"))

# Visualising to see how many of these null permutations have a difference
salary_null_dist %>% visualize() +
  shade_p_value(obs_stat = salary_diff, direction = "two-sided")

# Calculating the p-value for the hypothesis test
salary_null_dist %>%
  get_p_value(obs_stat = salary_diff, direction = "two_sided")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

The test result shows that the p-value is very low, smaller than 1%, so we reject the null hypothesis and conclude the difference in mean salaries between genders is statistically significant. Let us look at the correlation analysis.

# make gender a binary variable so that a correlation analysis is possible (male = 1, female = 0)
omega2 <- omega %>%
  mutate(gender_num = ifelse(gender == "male",1,0))

cor(salary ~ gender_num, data = omega2, method = "pearson")
## [1] 0.508

The correlation coefficient of 0.508 indicates that there is indeed a positive correlation between the gender of the executives and their salary. Let us know perform a linear regression.

linear_regression <- lm(salary ~ gender_num, data=omega2)
print(linear_regression)
## 
## Call:
## lm(formula = salary ~ gender_num, data = omega2)
## 
## Coefficients:
## (Intercept)   gender_num  
##       64543         8696
summary(linear_regression)
## 
## Call:
## lm(formula = salary ~ gender_num, data = omega2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18471  -4780    127   5484  14257 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    64543       1474   43.78  < 2e-16 ***
## gender_num      8696       2128    4.09  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7520 on 48 degrees of freedom
## Multiple R-squared:  0.258,  Adjusted R-squared:  0.243 
## F-statistic: 16.7 on 1 and 48 DF,  p-value: 0.000165

Looking at the results of our linear regression model, unsurprisingly conveys the same conclusion, a statistically significant difference in salaries between the genders and displays the same difference of $8,696 observed doing the confidence interval analysis.

In conclusion, we observed a significant gap in salaries between men and women at Omega. However, we cannot yet conclude that this gap at Omega is a result of gender discrimination as there may be other factors, which we will investigate later, that have an impact on this relationship. Apart from logic the other hint we get that our relationship may be suffering from omitted variable bias is the low R^2 value of around 25%, which although not a perfect measure suggests that only 25% of the changes in salary can be explained by gender in this dataset.

3.3 Relationship Experience - Gender?

At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).

# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)
##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0

Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Perform similar analyses as in the previous section. Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?

Firstly, let us look at the confidence intervals again

omega %>%
  group_by(gender) %>%
  summarise(mean_exp = mean(experience),
            sd_exp = sd(experience),
            count = n(),
            se_exp = sd_exp/sqrt(count),
            t_critical = qt(0.975, count-1),
            margin_of_error = t_critical*se_exp,
            lower_ci = mean_exp - t_critical*se_exp,
            upper_ci = mean_exp + t_critical*se_exp)
## # A tibble: 2 x 9
##   gender mean_exp sd_exp count se_exp t_critical margin_of_error lower_ci
##   <chr>     <dbl>  <dbl> <int>  <dbl>      <dbl>           <dbl>    <dbl>
## 1 female     7.38   8.51    26   1.67       2.06            3.44     3.95
## 2 male      21.1   10.9     24   2.23       2.07            4.61    16.5 
## # … with 1 more variable: upper_ci <dbl>

As stated above, there is a difference of about 14 years in experience between male and female employees in our dataset, and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.

# hypothesis testing using t.test()

t.test(experience ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.35  -8.13
## sample estimates:
## mean in group female   mean in group male 
##                 7.38                21.12
# hypothesis testing using infer package

# initialize the test, which we save as `directors_diff`
exp_diff <- omega %>%
    specify(experience ~ gender) %>%
  
    # the statistic we are searching for is the difference in means, with the order being "male", "female"
    calculate(stat = "diff in means", order = c("male", "female"))

# simulate the test on the null distribution, which we save as null
exp_null_dist <- omega %>%
    specify(experience ~ gender) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 1000, type = "permute") %>%
    calculate(stat = "diff in means", order = c("male", "female"))

# Visualising to see how many of these null permutations have a difference
exp_null_dist %>% visualize() +
  shade_p_value(obs_stat = exp_diff, direction = "two-sided")

# Calculating the p-value for the hypothesis test
exp_null_dist %>%
  get_p_value(obs_stat = exp_diff, direction = "two_sided")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

The test result shows that the p-value is very low, smaller than 1%, so we reject the H0 and think the mean of years of experience between genders are different. Let us look at the correlation analysis.

cor(experience ~ gender_num, data = omega2, method = "pearson")
## [1] 0.584

The correlation coefficient of 0.584 indicates that there is indeed a positive correlation between the gender of the executives and their years of experience. We also find that the correlation between experience and gender is higher than the correlation between salary and gender. Let us now perform a linear regression.

linear_regression2 <- lm(experience ~ gender_num, data=omega2)
print(linear_regression2)
## 
## Call:
## lm(formula = experience ~ gender_num, data = omega2)
## 
## Coefficients:
## (Intercept)   gender_num  
##        7.38        13.74
summary(linear_regression2)
## 
## Call:
## lm(formula = experience ~ gender_num, data = omega2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20.12  -6.32  -2.25   8.37  22.88 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.38       1.91    3.87  0.00033 ***
## gender_num     13.74       2.76    4.98  8.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.74 on 48 degrees of freedom
## Multiple R-squared:  0.341,  Adjusted R-squared:  0.327 
## F-statistic: 24.8 on 1 and 48 DF,  p-value: 8.51e-06

Looking at the results of our linear regression model, we can see that both p-values are well below 1% and therefore indicate a statistical significance of this regression model. However, both our R-squared and adj. R-squared values only amount to 0.341 and 0.327 respectively. In other words, only c. 34% of the variance for our dependent variable (experience) can be explained by our independent variable (gender).

Concluding, we can state that there is indeed a significant difference in years of experience between men and women in our dataset. Tying this conclusion to our previous findings about the relationship between salary and gender. Although there is a difference in salaries between the genders, finding out that there is also a significant difference in levels of experience potentially undermines a conclusion that this difference is a result of gender discrimination. As the difference may just be as a result of these differences in experience. However to conclude that we need to understand the relationship between salary and experience and see if this relationship is the same between genders, ie. that both genders are equally being compensated for additional experience

3.4 Relationship Salary - Experience ?

Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.

Analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data

omega %>% 
ggplot(mapping = aes(x= experience, y = salary, color = gender)) +
  geom_point()+
  labs(title = "Greater experience results in higher salaries",
       subtitle = "Male executives at Omega are more experienced",
       x = "Salary in USD",
       y = "Years of experience")

cor(experience ~ salary, data = omega, method = "pearson")
## [1] 0.803
t.test(experience ~ gender, data = omega)
## 
##  Welch Two Sample t-test
## 
## data:  experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.35  -8.13
## sample estimates:
## mean in group female   mean in group male 
##                 7.38                21.12

3.5 Check correlations between the data

You can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

The scatterplot between the experience and and salary shows that there is 7 female executives that have just been hired into their role that have lower wages and a further 7 that have less than 5 years of experiences that also have lower wages than the more experienced male executives. It also shows that conversely the most experienced executives that have been there for 30+ years and that are among the best compensated in the company are exclusively male. Furthermore, from the right middle panel we gather that experience has a greater correlation with wages for female executives rather than their male counterparts, perhaps hinting at other factors that play a role in compensation, perhaps the size of team that each executive has or geographic coverage. In conclusion, when discussing the issue of pay discrimination at this company it seems that the disparity between wages comes from the fact that historically the company only had male executives. However now the company has made an effort to hire predominantly females into executive positions and that with time as the most experienced (male) executives start to retire and more importantly the generally less experienced female executives gain years of experience their compensation should converge, therefore the real test at Omega will be repeating this analysis in 10 years time

4 Challenge 1: Yield Curve inversion

At the end of this challenge, we will produce this chart that uses the FRED database to downlaod historical yield curve rates and plot the yield curves since 1999.

# Get a list of FRED codes for US rates and US yield curve; choose monthly frequency
# to see, eg., the 3-month T-bill https://fred.stlouisfed.org/series/TB3MS
tickers <- c('TB3MS', # 3-month Treasury bill (or T-bill)
             'TB6MS', # 6-month
             'GS1',   # 1-year
             'GS2',   # 2-year, etc....
             'GS3',
             'GS5',
             'GS7',
             'GS10',
             'GS20',
             'GS30')  #.... all the way to the 30-year rate

# Turn  FRED codes to human readable variables
myvars <- c('3-Month Treasury Bill',
            '6-Month Treasury Bill',
            '1-Year Treasury Rate',
            '2-Year Treasury Rate',
            '3-Year Treasury Rate',
            '5-Year Treasury Rate',
            '7-Year Treasury Rate',
            '10-Year Treasury Rate',
            '20-Year Treasury Rate',
            '30-Year Treasury Rate')

maturity <- c('3m', '6m', '1y', '2y','3y','5y','7y','10y','20y','30y')

# by default R will sort these maturities alphabetically; but since we want
# to keep them in that exact order, we recast maturity as a factor 
# or categorical variable, with the levels defined as we want
maturity <- factor(maturity, levels = maturity)

# Create a lookup dataset
mylookup<-data.frame(symbol=tickers,var=myvars, maturity=maturity)
# Take a look:
mylookup %>% 
  knitr::kable()
symbol var maturity
TB3MS 3-Month Treasury Bill 3m
TB6MS 6-Month Treasury Bill 6m
GS1 1-Year Treasury Rate 1y
GS2 2-Year Treasury Rate 2y
GS3 3-Year Treasury Rate 3y
GS5 5-Year Treasury Rate 5y
GS7 7-Year Treasury Rate 7y
GS10 10-Year Treasury Rate 10y
GS20 20-Year Treasury Rate 20y
GS30 30-Year Treasury Rate 30y
df <- tickers %>% tidyquant::tq_get(get="economic.data", 
                   from="1960-01-01")   # start from January 1960

glimpse(df)
## Rows: 6,774
## Columns: 3
## $ symbol <chr> "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS",…
## $ date   <date> 1960-01-01, 1960-02-01, 1960-03-01, 1960-04-01, 1960-05-01, 1…
## $ price  <dbl> 4.35, 3.96, 3.31, 3.23, 3.29, 2.46, 2.30, 2.30, 2.48, 2.30, 2.…

This dataframe df has three columns (variables):

  • symbol: the FRED database ticker symbol
  • date: already a date object
  • price: the actual yield on that date
#Join this datframe 'df' with dataframe 'mylookup' to have a more readable version
yield_curve <-left_join(df,mylookup,by="symbol") 

4.0.1 Plotting Yields on US rates by duration since 1960

yield_curve %>% 
  mutate(var=factor(var, levels=c('3-Month Treasury Bill','6-Month Treasury Bill','1-Year Treasury Rate','2-Year Treasury Rate',"3-Year Treasury Rate","5-Year Treasury Rate","7-Year Treasury Rate","10-Year Treasury Rate","20-Year Treasury Rate","30-Year Treasury Rate"))) %>% 
  #plotting dates on x-axis and yield on y-axis with color represented by maturity
    ggplot(mapping = aes(x=date, y=price, color = var), show.legend = FALSE) +
  geom_line(show.legend = FALSE) +
  #to facet by the maturity and have 2 columns for the graphs
  facet_wrap(~var,ncol=2) +
  #titles for the plot
  labs(title = "Yields on US Treasury rates since 1960",
       x = "",
       y = "%",
       caption = "Source: St. Louis Federal Reserve Economic Database (FRED)") +
  #adding a theme
  theme_bw() +
  #no legend
  theme(legend.position = "none") 

4.0.2 Monthly yields on US rates by duration since 1999 on a year-by-year basis

yield_curve %>% 
  #filter to show treasuries in and after 1999
  filter(year(date)>=1999) %>% 
  #plotting maturity on x axis and yield on y axis, and it is grouped by month colored by year
  ggplot(aes(
          x = maturity,
          y = price,
          group = month(date),
          color = as.factor(year(date)))) +
  geom_line() +
  #faceted by year with graphs in 4 columns
  facet_wrap(~ year(date), ncol = 4) +
  theme_bw() +
  #labelling the axes and titles
  labs(title = "US Yield Curve",
       x = "Maturity",
       y = "Yield (%)",
       caption = "Source: St. Louis Federal Reserve Economic Database (FRED)") +
  #no legend
  theme(legend.position = "none",
        axis.text.x=element_text(size=7))

4.0.3 3-month and 10-year yields since 1999

yield_curve %>% 
  #filtering to have treasuries in or after 1999 and only 3 month and 10 year bills and rates
  filter(year(date) >= 1999, 
         maturity == c("3m","10y")) %>% 
  #plotting with years on x-axis and yields on y-axis colored according to maturities
  ggplot(mapping = 
           aes(x = date, 
             y = price, 
             group = var,
             color = var)) +
    geom_line() +
  #to change position of the items in the legend
    scale_color_discrete(breaks = c("3-Month Treasury Bill", 
                                  "10-Year Treasury Rate")) +
  #labelling the axes and title
  labs(title = "Yields on 3-Month and 10-Year Treasury Rates since 1999",
       x = "",
       y = "%",
       caption = "Source: St. Louis Federal Reserve Economic Database (FRED)") +
  theme_bw() +
  #no title for the legend
  theme(legend.title = element_blank())

4.0.4 Yield curve inversion

According to Wikipedia’s list of recession in the United States, since 1999 there have been two recession in the US: between Mar 2001–Nov 2001 and between Dec 2007–June 2009. Does the yield curve seem to flatten before these recessions?

Yes, it seems that between those two dates the yield curve flattens before inverting.

Can a yield curve flattening really mean a recession is coming in the US? According to the following graph, it seems that where the yield curve flattens, a reccession followed. Though it cannot be taken as the only factor, it should be taken as a warning signal.

Since 1999, when did short-term (3 months) yield more than longer term (10 years) debt?

The following graphs shows 9 occassions on when this happened, as we calculate the spread (10-year-3months) and label it ‘diff’

# get US recession dates after 1946 from Wikipedia 
# https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States

#creates a dataframe with all US recessions since 1946
recessions <- tibble(
  from = c("1948-11-01", "1953-07-01", "1957-08-01", "1960-04-01", "1969-12-01", "1973-11-01", "1980-01-01","1981-07-01", "1990-07-01", "2001-03-01", "2007-12-01"),  
  to = c("1949-10-01", "1954-05-01", "1958-04-01", "1961-02-01", "1970-11-01", "1975-03-01", "1980-07-01", "1982-11-01", "1991-03-01", "2001-11-01", "2009-06-01") 
  )  %>% 
  mutate(From = ymd(from), 
         To=ymd(to),
         duration_days = To-From)

recessions
## # A tibble: 11 x 5
##    from       to         From       To         duration_days
##    <chr>      <chr>      <date>     <date>     <drtn>       
##  1 1948-11-01 1949-10-01 1948-11-01 1949-10-01 334 days     
##  2 1953-07-01 1954-05-01 1953-07-01 1954-05-01 304 days     
##  3 1957-08-01 1958-04-01 1957-08-01 1958-04-01 243 days     
##  4 1960-04-01 1961-02-01 1960-04-01 1961-02-01 306 days     
##  5 1969-12-01 1970-11-01 1969-12-01 1970-11-01 335 days     
##  6 1973-11-01 1975-03-01 1973-11-01 1975-03-01 485 days     
##  7 1980-01-01 1980-07-01 1980-01-01 1980-07-01 182 days     
##  8 1981-07-01 1982-11-01 1981-07-01 1982-11-01 488 days     
##  9 1990-07-01 1991-03-01 1990-07-01 1991-03-01 243 days     
## 10 2001-03-01 2001-11-01 2001-03-01 2001-11-01 245 days     
## 11 2007-12-01 2009-06-01 2007-12-01 2009-06-01 548 days
#creating a new dataframe called g1
g1 <- yield_curve %>% 
  #choosing date, price, and symbol from existing dataframe
    select(date,price,symbol) %>% 
  #a pivot wider to take values from price column and names from symbol column
  pivot_wider(names_from = symbol, values_from = price) %>% 
  #create a new column called diff that calculates the spread
  mutate(diff = GS10 - TB3MS)


g1 %>%
  #plotting the graph using spread in y-axis and year in x-axis
  ggplot(aes(x = date, y = diff)) +
  #creating grey boxes using dataframe recessions to indicate the periods of recessions
  geom_rect(data = recessions, 
            inherit.aes = FALSE, 
            aes(xmin = From, 
                xmax = To, 
                ymin = -Inf, 
                ymax = Inf),
            alpha = 1, 
            fill = "grey")+
  geom_line() +
  #creating a line where y=0
  geom_hline(yintercept = 0) +
  #indicating color of graph (blue when it is above x-axis and red when it is below)
  geom_ribbon(aes(ymin = 0,
                  ymax = pmax(diff, 0)),
              fill = "#A6BDDB", alpha = 0.5) +
  geom_ribbon(aes(ymin = pmin(diff, 0),
                  ymax = 0),
              fill = "#FC787E", alpha = 0.5) +
  #creating the lines at the bottom and filling the appropriate color
  geom_rug(sides = "b", colour = case_when(g1$diff >= 0 ~ "#A6BDDB",
                                            g1$diff < 0 ~ "#FC787E")) +
  #changing the x-axis scale
  scale_x_date(date_labels="%Y", date_breaks  = "2 years",  limits= c(as.Date("1959-01-01"), as.Date("2023-01-01"))) +
  #labelling the axes and title
  labs(title = "Yield Curve Inversion 10-Year minus 3-Month US Treasury Rates",
       subtitle = "Difference in % points, monthly averages.\nShaded areas correspond to recessions",
       x = "",
       y = "Difference (10-Year - 3-Month) Yield in %",
       caption = "Source: FRED, Federal Reserve Bank of St. Louis") +
  theme_minimal() +
  #no legend
  theme(axis.text.x=element_text(size=5),
        legend.position = "none",
        plot.title=element_text(face="bold"),   #making the title bold
        plot.subtitle=element_text(face="italic", size = 8)   #making the title bold
        ) 

5 Challenge 2:GDP components over time and among countries

The main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports). The GDP data we looked at is from the United Nations’ National Accounts Main Aggregates Database. We looked at how GDP and its components have changed over time, and compareed different countries and how much each component contributes to that country’s GDP. The file we worked with is GDP and its breakdown at constant 2010 prices in US Dollars.

UN_GDP_data  <-  read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
                sheet="Download-GDPconstant-USD-countr", # Sheet name
                skip=2) # Number of rows to skip

We first tidied the data, and converted it from wide format to long, tidy format. We expressed all figures in billions and rename the indicators into something shorter.

tidy_GDP_data  <-  UN_GDP_data %>%
  #Covert wide format to long format
  pivot_longer(cols = 4:51,
               names_to = "year",
               values_to = "value") %>% 
  #Divide GDP value by 10^9, and create a column with shortened indicator names, respectively.
  mutate(value = value/1e9,
         year = as.numeric(year),
         indicator = case_when(
           IndicatorName == "Household consumption expenditure (including Non-profit institutions serving households)" ~ "C",
           IndicatorName == "Gross Domestic Product (GDP)" ~ "GDP",
           IndicatorName == "Imports of goods and services" ~ "Imports",
           IndicatorName == "Exports of goods and services" ~ "Exports",
           IndicatorName == "General government final consumption expenditure" ~ "G",           
           IndicatorName == "Gross capital formation" ~ "I"
           )) %>% 
  #Remove country_id and indicator_name
  select(-1,-3) %>% 
  filter(!is.na(indicator)) %>% 
  clean_names()

glimpse(tidy_GDP_data)
## Rows: 63,072
## Columns: 4
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",…
## $ year      <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,…
## $ value     <dbl> 5.07, 4.84, 4.70, 5.21, 5.59, 5.65, 5.68, 6.15, 6.30, 6.15,…
## $ indicator <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",…
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")

tidy_GDP_data %>%
  #Choose only the countries we are interested in
  filter(country %in% country_list, !indicator=="GDP") %>%
  #Plot the lines of indicators we are interested in
  ggplot(mapping = aes(x=year, y=value, group = indicator, color = indicator)) +
  geom_line() +
  #Draw the plot for each country
  facet_wrap(~country)+
  #Set mark points for x and y axes
  scale_y_continuous(breaks = seq(0, 12500, 2500))+
  scale_x_continuous(breaks = c(1970, 1980, 1990, 2000, 2010))+
  scale_color_discrete(name="Components of GDP",
                      breaks=c("I", "Exports", "G", "C", "Imports"),
                      labels=c("Gross Capital formation", "Exports", "Government expenditure", "Household expenditure", "Imports"))+
  theme_bw()+
  coord_fixed(ratio = 0.007)+
  theme(plot.margin=grid::unit(c(0,0,0,0), "mm"),
        legend.margin=unit(.2,"inches"))+
  labs(x = "",
       y= "Billion US$",
       title = "GDP components over time",
       subtitle = "In constant 2010 USD")+
  NULL

Also, GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in the dataframe, we calculated it given its components discussed above.

tidy_GDP_data %>%
  #Choose the countries we are interested in
  filter(country %in% country_list) %>%
  pivot_wider(values_from = "value",
              names_from = "indicator") %>% 
  clean_names() %>% 
  #Create new columns of weight percentage of Household Expenditure (Consumption *C*), Gross Capital Formation (business investment *I*), Government Expenditure (G) and Net Exports (exports - imports) in calculated GDP; Also the percentage error between caculated and reported GDP.
  mutate(calculated_gdp = c+g+i+exports-imports,
         c_pro = c/calculated_gdp,
         i_pro = i/calculated_gdp,
         g_pro = g/calculated_gdp,
         netex_pro = (exports - imports)/calculated_gdp,
         percent_error = 1-calculated_gdp/gdp) %>% 
  
  #plot the items mentioned above, separated by countries, set mark points and end points
  select("country", "year", "c_pro", "i_pro", "g_pro", "netex_pro", "percent_error") %>%
  pivot_longer(cols = 3:7,
               names_to = "indicator",
               values_to = "value") %>% 
  ggplot(mapping = aes(x=year, y=value, group = indicator, color = indicator)) +
  geom_line() +
  facet_wrap(~country)+
  scale_y_continuous(limits = c(-0.2, 0.75), breaks = seq(-0.2, 0.6, 0.2), labels = scales::percent)+
  scale_x_continuous(breaks = c(1970, 1980, 1990, 2000, 2010))+
  scale_color_discrete(name="",
                      breaks=c("g_pro", "i_pro", "c_pro", "netex_pro","percent_error"),
                      labels=c("Government expenditure", "Gross Capital formation", "Household expenditure", "Net exports", "Percentage error between reported and caculated GDP"))+
  theme_bw()+
    labs(title = "GDP and its breakdown at constant 2010 prices in US dollars",
               x= "",
               y="proportion") +
  coord_fixed(ratio = 60)+
  theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"),
        legend.margin=unit(.2,"inches"))

        panel.margin=unit(c(0,0,0,0), "cm")

We added a line to demonstrate the percentage error between reported GDP and calculated GDP and observed from the plots that the error could range from 0 to 5%.

Besides, the last chart tells us a few things:

1) German percentage of net exports in GDP increased in 2000-2017, while India and USA mostly had negative net exports. One possible reason is that Germany is a leader in industrials, producing products with higher value, thus its net export was positive.

2) India and USA have the higher percentage of household expenditure in GDP than Germany, with India’s decreasing and USA’s increasing. This probably was because that USA and India have higher economy growth rate than Germany since USA has the highest percentage of household expenditure in GDP, indicating higher domestic market demand.

3)India and Germany have higher percentage of gross capital formation in GDP than USA, with India’s grew significantly in 2000-2017. One possible explanation is that India attracted many businesses to invest in India in 2000-2017 due to its lower labor costs than Germany and USA.

4) USA and Germany have higher percentage of government expenditure in GDP than India. This might mean that Governments of USA and Germany have better financial conditions than government of India.

6 Details

  • Who did you collaborate with: Bartek Mozdzen, Disha Sethi, Marius Dippe, Danya Liu, Haopeng Zhou, Aris Georgakopoulos
  • Approximately how much time did you spend on this problem set: A fair amount of time
  • What, if anything, gave you the most trouble: Challenge 1